Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian

نویسندگان

  • Bozo Bekavac
  • Petya Osenova
  • Kiril Ivanov Simov
  • Marko Tadic
چکیده

This paper describes the first steps towards the creation of a Bulgarian-Croatian comparable corpus. Its base are two newspaper subcorpora from larger reference corpora of Bulgarian and Croatian. In the beginning we rely on more extralinguistically-oriented, but methodologically cleaner parameters of similarity like: specific topics, pre-defined time span and data size. The idea of ‘light’ and ‘hard’ comparable corpora is introduced. At this stage we aim at producing a ‘light’ bilingual comparable corpus. The algorithm for identifying lexical similarity and aligning linguistic units is presented, and the initial experiments are outlined.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The verbal prefix o(b)– in Croatian and Bulgarian: The semantic network and challenges of a corpus–based study

This study compares the verbal prefix o(b)– in two South Slavic languages, Croatian and Bulgarian, from a cognitive linguistic perspective. We focus on the problems arising when constructing the semantic network of this polysemous prefix, particularly on 1) isolating the prefix’s meaning from the meaning of the base verb and 2) identifying core/dominant sub–meanings for all verbs and giving the...

متن کامل

MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora

The paper presents the fourth, “Mondilex” edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development, focused on the morphosyntactic level of linguistic description. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specif...

متن کامل

Extracting Lay Paraphrases of Specialized Expressions from Monolingual Comparable Medical Corpora

Whereas multilingual comparable corpora have been used to identify translations of words or terms, monolingual corpora can help identify paraphrases. The present work addresses paraphrases found between two different discourse types: specialized and lay texts. We therefore built comparable corpora of specialized and lay texts in order to detect equivalent lay and specialized expressions. We ide...

متن کامل

Utilizing Citations of Foreign Words in Corpus-Based Dictionary Generation

Previous work concerned with the identification of word translations from text collections has been either based on parallel or on comparable corpora of the respective languages. In the case of comparable corpora basic dictionaries have been necessary to form a bridge between the languages under consideration. We present here a novel approach to identify word translations from a single monoling...

متن کامل

Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair

This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool fo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004